Analysis and visualization of WineEnthusiast wine reviews¶

Author: Manuele Nolli, student BSc Computer Science SUPSI

Date: 28.11.2022

Mail: manuele.nolli@student.supsi.ch

Introduction¶

This document is an analysis of a public dataset found on Kaggle.com

The dataset contains 80k wine reviews with variety, location, winery, price, points, taster nam and description.

My analysis will focus on the following questions:

  • Where are the wines produced?
  • What is the distribution of the points?
  • What is the distribution of the prices, and is it related to the points?
  • What is the distribution of the variety of wines?
  • How much tasters are there and how much reviews each of them has done?
    • Are there tasters that are more reliable than others?
    • Have the tasters a preference for a specific continent/country?
  • What are the most common words in the description of the wines?

Notebook setup¶

¶

In [1]:
import numpy as np
import plotly.express as px
import plotly.graph_objs as go
import pandas as pd
from plotly.subplots import make_subplots
df=pd.read_csv("data/winemag2017-2020/winemag2017-2020.csv")

Datset details¶

Whit the following code we can see the details of the dataset and how it is structured and the type of the columns.

In [2]:
print(f"---Dataset Info---")

#printing column names
print(f"Total columns: {len(df.columns)}")
print("Columns names:", end=" ")
for col in df:
    if col == 'winery':
        print(col, end=".")
    else: 
        print(col, end=", ")
print()

#columns types
print(f"Columns type:")

#creating temp array
columnData = []
dfIndexType = []

for col in df.columns:
    temp = []
    dfIndexType.append(col)
    temp.append(df[col].apply(type).unique())
    temp.append(df[col].isnull().sum())
    columnData.append(temp)

#create new Dataframe
dfColumnsType = pd.DataFrame(columnData, columns=['Types','NaN Count'])
dfColumnsType.index = dfIndexType
#print columns type
display(dfColumnsType)

#df size
print(f"Dataframe rows: {len(df)}")

#df sample
print("Dataset samples:")
df.sample(5)
---Dataset Info---
Total columns: 15
Columns names: country, description, designation, points, price, province, region_1, region_2, taster_name, taster_photo, taster_twitter_handle, title, variety, vintage, winery.
Columns type:
Types NaN Count
country [<class 'str'>, <class 'float'>] 5
description [<class 'str'>] 0
designation [<class 'str'>, <class 'float'>] 21319
points [<class 'int'>] 0
price [<class 'float'>] 4647
province [<class 'str'>, <class 'float'>] 5
region_1 [<class 'float'>, <class 'str'>] 12913
region_2 [<class 'float'>, <class 'str'>] 49894
taster_name [<class 'str'>, <class 'float'>] 150
taster_photo [<class 'str'>, <class 'float'>] 150
taster_twitter_handle [<class 'str'>, <class 'float'>] 1076
title [<class 'str'>] 0
variety [<class 'str'>] 0
vintage [<class 'str'>] 0
winery [<class 'str'>] 0
Dataframe rows: 81115
Dataset samples:
Out[2]:
country description designation points price province region_1 region_2 taster_name taster_photo taster_twitter_handle title variety vintage winery
78071 US This Cabernet Franc rosé shows a pleasing inte... Dry 90 17.0 New York Finger Lakes Finger Lakes Alexander Peartree https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @apatrone23 Lamoreaux Landing 2019 Dry Rosé (Finger Lakes) Rosé 2019 Lamoreaux Landing
75848 US Tannic and oaky, this is rough-hewn with flavo... Estate 85 30.0 Oregon Applegate Valley Southern Oregon Paul Gregutt https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @paulgwine Schmidt 2016 Estate Tempranillo (Applegate Val... Tempranillo 2016 Schmidt
8932 US This was barrel fermented, and picked up a fai... NaN 87 16.0 Oregon Elkton Oregon Southern Oregon Paul Gregutt https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @paulgwine River's Edge 2015 Pinot Gris (Elkton Oregon) Pinot Gris 2015 River's Edge
52723 France Produced from old vines in a single vineyard, ... Croix de Montceau 90 35.0 Burgundy Saint-Véran NaN Roger Voss https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @vossroger Vignerons des Terres Secrètes 2016 Croix de Mo... Chardonnay 2016 Vignerons des Terres Secrètes
30940 Italy Camphor, cedar, coconut and dried aromatic her... Badarina 92 90.0 Piedmont Barolo NaN Kerin O’Keefe https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @kerinokeefe Grimaldi Bruna 2015 Badarina (Barolo) Nebbiolo 2015 Grimaldi Bruna

It is possible to see that the dataset contains 80k rows and 15 columns. The columns are:

  • country: the country of origin of wine
  • description: a few sentences from a sommelier describing the wine's taste, smell, look, feel, etc.
  • designation: the vineyard within the winery where the grapes that made the wine are from
  • points: the number of points WineEnthusiast rated the wine on a scale of 1-100 (though they say they only post reviews for wines that score >=80)
  • price: the cost for a bottle of the wine
  • province: the province or state that the wine is from
  • region_1: the wine growing area in a province or state (ie Napa)
  • region_2: sometimes there are more specific regions specified within a wine growing area (ie Rutherford inside the Napa Valley), but this value can sometimes be blank
  • taster_name: name of the person who tasted and reviewed the wine
  • taster_photo: url of the taster's photo
  • taster_twitter_handle: Twitter handle for the person who tasted and reviewed the wine
  • title: the title of the wine review
  • variety: the type of grapes used to make the wine (ie Pinot Noir)
  • vintage: the vintage of the wine
  • winery: the winery that made the wine

Start Analysis¶

Distribution of wines across continents¶

In this section it is possible see the distribution of the wines across the continents. I used the country column to see the distribution of the wines across the continents. I decided to create a new column called continent that contains the continent of the country.

In [3]:
#Continent list
europe = ['Austria', 'Bosnia and Herzegovina','Bulgaria','Croatia','Cyprus','Czech Republic','England', 'France','Germany','Greece','Italy','Luxembourg','Portugal','Hungary', 'Macedonia', 'Moldova', 'Romania', 'Serbia', 'Slovakia', 'Slovenia', 'Spain', 'Switzerland', 'Turkey', 'Ukraine']
asia = ['Armenia', 'China','India','Israel','Lebanon' ]
northAmerica = ['Canada','US','Mexico']
sudAmerica = ['Argentina',',Brazil','Chile','Peru','Uruguay'] 
oceania = ['Australia','New Zealand'] 
africa = ['South Africa','Morocco']
other = ['Egypt', 'Georgia']

#Chose to set as 'Other' all the continent with a small amout of reviews 
def continentDispacher(row):
    if row['country'] in europe:
        val = 'Europe'
    elif row['country'] in asia:
        #val = 'Asia'
        val = 'Other'
    elif row['country'] in northAmerica:
        val = 'North America'
    elif row['country'] in sudAmerica:
        #val = 'Sud America'
        val = 'Other'
    elif row['country'] in oceania:
        #val = 'Oceania'
        val = 'Other'
    elif row['country'] in africa:
        #val = 'Africa'
        val = 'Other'
    else:
        val = 'Other'

    return val

df['continent'] = df.apply(continentDispacher,1)

The following code shows the distribution of the wines across the continents trough a pie chart. It is possible to see that the majority of the wines are produced in Europe, followed by North America.

In [4]:
#Ditrubution of the wines by continent
pieContinent = px.pie(df, names='continent', title='Distribution of wines across continents')
pieContinent.update_traces(textposition='inside', textinfo='percent+label')
pieContinent.update(layout_showlegend=False)

#update layout for export
"""
pieContinent.update_layout(
    title={
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        font=dict(
        size=18),
        height=1000,
        width=1000)
"""

pieContinent.show()
In [5]:
#groupby country for have count
dfCountry = df.groupby('country').count().reset_index()
dfCountry = dfCountry[['country','continent']]
dfCountry.columns = ['country','count']

#display dfCountry in a maps
fig = px.choropleth(dfCountry, locations="country", locationmode='country names', color="count", hover_name="country", color_continuous_scale=px.colors.sequential.Plasma)

#more realistic map
fig.update_geos(projection_type="natural earth")

#update layout for enlarge the map
fig.update_layout(margin={"r":0,"t":50,"l":0,"b":0},title = 'Wine distribution across countries')

#update layout for export
"""
fig.update_layout(
    title={
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        font=dict(
        size=18),
        height=1000,
        width=2000)
"""

fig.show()
In [6]:
#groupby continent, country, region1 and region2 for have count
dfRegion = df.groupby(['continent','country','region_1','region_2'], dropna=False).count().reset_index()
dfRegion = dfRegion[['continent','country','region_1','region_2','points']]
dfRegion.columns = ['continent','country','region_1','region_2','count']

#I can't find a best way to show the data with region1 or 2 as null
dfRegion.fillna('None', inplace=True)

fig = px.treemap(dfRegion, path=["continent", 'country', 'region_1', 'region_2'],branchvalues="total", values='count', title='Wine distribution across countries')
fig.show()

#create a sunburst chart
fig = px.sunburst(dfRegion, path=["continent", 'country', 'region_1', 'region_2'], values='count', title='Wine distribution across countries')

The above chart is an alternative way to see the distribution of the wines across the continents. It is more interactive and it is possible to see the exact number of wines produced in each continent, country and region.

Points distribution¶

Another interesting aspect of the dataset is the distribution of the points. The points are given by the tasters and they are on a scale from 80 to 100 and WineEnthusiast has another way to group the wine by 5 categories:

  • 80–82: ACCEPTABLE Can be employed
  • 83–86: GOOD Suitable for everyday consumption; often good value
  • 87–89: VERY GOOD Often good value; well recommended
  • 90–93: EXCELLENT Highly recommended
  • 94–97: SUPERB A great achievement
  • 98–100: CLASSIC The pinnacle of quality

In the following section a new column called pointsDescription is created that contains the description of the score.

In [7]:
#Create new column with points description
def pointsDispacher(points):
    if points < 83:
        val = 'Acceptable'
    elif points < 87:
        val = 'Good'
    elif points < 90:
        val = 'Very good'
    elif points <93:
        val = 'Excellent'
    elif points <97:
        val = 'Superb'
    else:
        val = 'Classic'

    return val

#Create new column with points description
df['pointsDescription'] = df['points'].map(pointsDispacher)
In [8]:
#Histogram of points description
pointDistribution = px.histogram(df, x='points', color='pointsDescription', title='Points distribution', height=500,
 category_orders=dict(pointsDescription=['Classic', 'Superb', 'Excellent', 'Very good', 'Good','Acceptable']), 
                  labels={
                     "pointsDescription": "Point Description"
                 },
                 color_discrete_map = {'Classic':'#903f5c','Superb':'#006179','Excellent':'#008377','Very good':'#09a259', 'Good':'#90b827', 'Acceptable':'#ffbf00'}

)
#Update axis
pointDistribution.update_xaxes(title='Point',tickmode='linear')
pointDistribution.update_yaxes(title='Count')

#update layout for export
"""
pointDistribution.update_layout(
    title={
        'x':0.5,
        'xanchor': 'center',
        'yanchor': 'top'},
        font=dict(
        size=18),
        height=700,
        width=2000)
"""

pointDistribution.show()

From this graph it is possible to see that the majority of the wines are in the Good category, followed by the very good category (the middles scores are the most common).

It is curious to see that there are more wines with 90 points than with 89 points. That is probably because the tasters are more likely to give a wine 90 points than 89 points to have the wine labeled as Excellent.

Vintage distribution¶

In this section it is possible to see the distribution of the vintage of the wines. The vintage is the year in which the grapes were harvested.

In [9]:
import datetime

dfVintageWithoutNaN = df.copy()

#Remove 'NV' string = NotVintage, when multiple kind of wine of different years are blended 
dfVintageWithoutNaN = dfVintageWithoutNaN[dfVintageWithoutNaN['vintage'] != 'NV']

dfVintageWithoutNaN['vintage'] = pd.DatetimeIndex(dfVintageWithoutNaN['vintage']).year

#Removing impossible data
dfVintageWithoutNaN = dfVintageWithoutNaN[dfVintageWithoutNaN['vintage'] < datetime.datetime.now().year] #year in the future

#Removing wine with year as a title (for doing that I assume that an old wine cost at least 100)
dfVintageWithoutNaN = dfVintageWithoutNaN.drop(dfVintageWithoutNaN[(dfVintageWithoutNaN['vintage'] < 1980) & (dfVintageWithoutNaN['price'] < 100) |(dfVintageWithoutNaN['price'].isna())].index)

#Histogram of vintage distribution
vintageDistribution = px.histogram(dfVintageWithoutNaN, x="vintage", title='Vintage review distribution')

#Update axis
vintageDistribution.update_xaxes(title='Year',dtick=1)
vintageDistribution.update_yaxes(title='Count')

vintageDistribution.show()

It must be remembered that the dataset contains wines reviewed beetwen 2017 and 2020. It is normal to see that the majority of the wines are from the past years. But, there are also some very old wines in the dataset. The oldest wine is from 1931 and surprisely it does not have a very high score.

In [10]:
dfVintageWithoutNaN.loc[dfVintageWithoutNaN['vintage'] == 1931]
Out[10]:
country description designation points price province region_1 region_2 taster_name taster_photo taster_twitter_handle title variety vintage winery continent pointsDescription
2722 Portugal This remarkable wine looks old, and with its d... Tinto 89 550.0 Colares NaN NaN Roger Voss https://253qv1sx4ey389p9wtpp9sj0-wpengine.netd... @vossroger Adega Viuva Gomes 1931 Tinto Red (Colares) Ramisco 1931 Adega Viuva Gomes Europe Very good

Wine variety¶

In this section it is possible to see the distribution of the variety of the wines. The variety is the type of grapes used to make the wine (ie Pinot Noir). In the dataset there are many different varieties of wines but I decided to show only the top 10 varieties. It is possible to change this settings by changing the wineCountToShow variable.

Firstly, I created different versions of the dataset that thy will be used to create the graphs.

In [11]:
#Wine to be shown
wineCountToShow = 10

# Top {wineCountToShow} wine variety with the highest count
dfMostWineVariety = df.groupby(['variety']).size().to_frame().sort_values([0], ascending = False).head(wineCountToShow).reset_index()
dfMostWineVariety.columns.values[1] = 'count'

# Other wine variety
dfOtherWineVariety = df.groupby(['variety']).size().to_frame().sort_values([0], ascending = False).tail(len(df.groupby(['variety']).size()) - wineCountToShow).reset_index()
dfOtherWineVariety.columns.values[1] = 'count'


#Create order of bars
order = dfMostWineVariety['variety'].tolist()
order.reverse()
order = ['Other'] + order

# Top {wineCountToShow} wine variety with the highest count and the price
dfFiltered = df.copy()
dfFiltered = dfFiltered.loc[df['variety'].isin(dfMostWineVariety['variety'])]
dfFilteredPoints = dfFiltered.groupby(['variety']).agg({'points': ['mean']}).reset_index()

# Other wine variety
dfFilteredOtherWine = df.loc[df['variety'].isin(dfOtherWineVariety['variety'])]
dfFilteredOtherWinePoints = dfFilteredOtherWine.groupby(['variety']).agg({'points': ['mean']}).reset_index()

Now is finally the time to create the graphs. The left graph is a bar chart that shows the distribution of the wines, the center graph is another bar chart that shows the average points of the wines and the right graph is a box plot that shows the distribution of the prices of the wines.

In [12]:
groupbypoints = df.groupby(['pointsDescription','points']).size().to_frame().reset_index()
groupbypoints.columns.values[2] = 'count'
topReviewedWines = make_subplots(rows=1, cols=3,subplot_titles=('Reviews count',"Variety average points","Price distribution"), shared_yaxes=True,horizontal_spacing = 0.025
)


#Variety average points 
trace1 = go.Bar(y=dfFilteredPoints['variety'], x=dfFilteredPoints['points']['mean'],orientation='h',marker_color='rgba(101, 109, 255, 1)')
trace2 = go.Bar(x=[dfFilteredOtherWinePoints['points']['mean'].mean()], y=['Other'], name='Other', orientation='h',marker_color='rgba(55, 83, 109, 0.6)')

#Wine reviews based on variety
trace3 = go.Bar(y=dfMostWineVariety['variety'], x=dfMostWineVariety['count'], name='Top variety', orientation='h',marker_color='rgba(101, 109, 255, 1)')

trace4 = go.Bar(x=[dfOtherWineVariety['count'].sum()], y=['Other'], name='Other', orientation='h',marker_color='rgba(55, 83, 109, 0.6)')

#Price distribution
trace5 = go.Box(x=dfFiltered['price'], y=dfFiltered['variety'], orientation='h',marker_color='rgba(101, 109, 255, 1)')

trace6 = go.Box(x=dfFilteredOtherWine['price'], name='Other', orientation='h',marker_color='rgba(55, 83, 109, 0.6)')

#Add traces
topReviewedWines.add_trace(trace1, row=1, col=2)
topReviewedWines.add_trace(trace2, row=1, col=2)
topReviewedWines.add_trace(trace3, row=1, col=1)
topReviewedWines.add_trace(trace4, row=1, col=1)
topReviewedWines.add_trace(trace5, row=1, col=3)
topReviewedWines.add_trace(trace6, row=1, col=3)

#General layout
topReviewedWines.update_yaxes(categoryorder='array',categoryarray=order)
topReviewedWines.update_layout(showlegend=False)
topReviewedWines.update_layout(title=f'[top {wineCountToShow}] Reviewed wines')

#update title yaxis
topReviewedWines.update_yaxes(title_text='Wine variety', row=1, col=1)

#left graph layout
topReviewedWines.update_xaxes(title_text="Count", col=1)
topReviewedWines.update_xaxes(dtick=5000, col=1)

#center graph layout
topReviewedWines.update_xaxes(title_text="Point",  col=2)
topReviewedWines.update_xaxes(range=[80, 100], col=2)
topReviewedWines.update_xaxes(dtick=2, col=2)
#right graph layout
topReviewedWines.update_xaxes(title_text="Price USD", col=3)
topReviewedWines.update_xaxes(type="log", range=[0,4],  col=3)

#update layout for export
"""
topReviewedWines.update_layout(
        font=dict(
        size=25),
        height=1000,
        width=3000)
topReviewedWines.update_layout(title_font_size=1)
topReviewedWines.update_annotations(font_size=50)
"""

#Finally show the graph
topReviewedWines.show()

It is interesting to see that the other varieties have a lot more reviews than the top 10 varieties, this means that the dataframe is well balanced.

Wine - Price connection¶

There are two principal graph in this section, the first one show a box plot rappresenting the distribution of the prices by points and the second one show a percentage histogram of the prices grouped by a personal price description:

  • x-10 usd: Low
  • 11–40 usd: Medium
  • 41–100 usd: Expensive
  • 100–x usd: Luxury
In [13]:
#Offsetting the price
lowOffset = 10
mediumOffset = 40
expensiveOffset = 100

#Function to create a new column with the price range
def priceDispacher(price):
    if price <= lowOffset:
        val = 'Low'
    elif price <= mediumOffset:
        val = 'Medium'
    elif price <= expensiveOffset:
        val = 'Expensive'
    else:
        val = 'Luxury'
    return val

#Apply priceDispacher function to price column
df['priceDescription'] = df['price'].map(priceDispacher)
In [14]:
boxPricePoint = go.Figure()
boxPricePoint.add_trace(go.Box(x=df['points'], y=df['price'], orientation='v',marker_color='rgba(101, 109, 255, 1)', boxmean=True))

boxPricePoint.update_layout(xaxis_range=[79.5, 100.5], title='Price vs Points')

boxPricePoint.update_xaxes(title='Point', dtick=1)
boxPricePoint.update_yaxes(title='Price USD',type="log")
boxPricePoint.update_yaxes()

#update layout for export
"""
boxPricePoint.update_layout(
        font=dict(
        size=25),
        height=800,
        width=3000)
"""
boxPricePoint.show()

By looking at the box plot it is possible to see that the wines with the highest points are the most expensive as could be expected, so there is a strong connection between the price and the points. This is also confirmed by the following histogram that shows that the wines with the highest points are the most expensive.

In [15]:
averagepricePoint = px.histogram(df,x='points', color='priceDescription', barmode='stack', barnorm='percent',
 category_orders=dict(priceDescription=['Low', 'Medium', 'Expensive', 'Luxury']), title='Price distribution by points', labels={
                     "priceDescription": "Price Description"
                 }, color_discrete_sequence=px.colors.sequential.Teal
                 )

averagepricePoint.update_xaxes(title='Point', dtick=1)
averagepricePoint.update_yaxes(title='Count %')

#update layout for export
"""
averagepricePoint.update_layout(
        font=dict(
        size=25),
        height=800,
        width=3000)
"""

averagepricePoint.show()

It is curious to see that there are some wines with a very high price and a very low points and in the other side there are some wines with a very low price and a very high points. This means that the price is not the only factor that influence the points.

Note: I tried to create a graph object with the past two graph connected by the x-axis but it is currently not possible to do that with plotly. Further information: https://community.plotly.com/t/how-to-set-barmode-for-individual-subplots/47931

Reviewer distribution¶

Now it is time to see the distribution of the reviewers. I am interested in seeing how many reviewers there are and how many reviews each of them has done. I also want to see if there are some reviewers that are more reliable than others and if there are some reviewers that are more likely to review wines from a specific continent.

In [16]:
from itertools import product

tasterDistribution = make_subplots(rows=1, cols=3,subplot_titles=('Count',"Points distribution","Continent distribution"), shared_yaxes=True,horizontal_spacing = 0.01)

#Taster review count
trace1 = go.Histogram(y=df['taster_name'], name='Taster review count', marker_color='rgba(101, 109, 255, 1)')

#Point awarded
trace2 = go.Box(x=df['points'], y=df['taster_name'], name='Point awarded', orientation='h',marker_color='rgba(101, 109, 255, 1)' )

#Continent preference by taster
#groupby continent and taster and average
dfContinentTaster = df.groupby(['continent','taster_name']).size().reset_index(name='reviewPerContinent')
totReviewPerTaster = df.groupby(['taster_name'])['continent'].count().reset_index(name='totalReview')

##Merge the two dataframe into one##
#create a list of all the possible combination of taster and continent
combs = pd.DataFrame(list(product(df['continent'].unique(), df['taster_name'].unique())), 
                     columns=['continent', 'taster_name'])

#merge dfContinentTaster and combs for all the possible combination (goal: fill the missing value with 0)
dfContinentTaster = dfContinentTaster.merge(combs, how = 'right').fillna(0)

#finally merge with the total review per taster
dfContinentTaster = dfContinentTaster.merge(totReviewPerTaster, on='taster_name')

trace3 = go.Heatmap( x=dfContinentTaster['continent'], y=dfContinentTaster['taster_name'],z=(dfContinentTaster['reviewPerContinent']/dfContinentTaster['totalReview'])*100, name='Continent preference by taster', colorscale='Blues', colorbar=dict(title='Count %')) 

#create order by review count
order = df['taster_name'].value_counts().index 

#update layout
tasterDistribution.update_yaxes(categoryorder='array',categoryarray=order)
tasterDistribution.update_layout(showlegend=False, title='Taster review')

#layout for the first graph
tasterDistribution.update_xaxes(title='Count', row=1, col=1)
tasterDistribution.update_yaxes(title='Taster name', row=1, col=1)

#layout for the secondo graph
tasterDistribution.update_xaxes(title='Point awarded', dtick=2, range=[79.5, 100.5],row=1, col=2)

#layout for the third graph
tasterDistribution.update_xaxes(title='Continent', row=1, col=3)

#set background color

#add traces to the graph
tasterDistribution.add_trace(trace1, row=1, col=1)
tasterDistribution.add_trace(trace2, row=1, col=2)
tasterDistribution.add_trace(trace3, row=1, col=3)

#update layout for export
"""
tasterDistribution.update_layout(
        font=dict(
        size=25),
        height=1000,
        width=3000)
tasterDistribution.update_layout(title_font_size=1)
tasterDistribution.update_annotations(font_size=50)
"""

tasterDistribution.show()

There are different considerations to make:

  • There are in total 19 reviewers and some of them have done a huge amount of reviews, as example the reviewer Roger Voss has more than 17k reviews, that are more than 15 reviews per day for 3 years.
  • The graph in the center shows the distribution of the point awarded by the reviewers. It is possible to see that the reviewers are very consistent in the points they give to the wines.
  • The graph on the right shows the preference of the reviewers for a specific continent. It is possible to see that the reviewers are more likely to review wines from their continent (example: Roger Voss and Kerin O'Keefe live in Europe and Virginie Boone and Matt Kettmann live in North America).

Most used words in wine description for points¶

In this section I decided to represent the most used words in the description of the wines for each point. I used the description column to extract the words after a cleaning process.

In [17]:
#Most used words in wine description for each point
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import string
import re

import matplotlib as mpl
import matplotlib.pyplot as plt

import nltk.corpus
#nltk.download('stopwords')
from nltk.corpus import stopwords

#Function to clean the description
def cleanDescription(description):
    #remove punctuation
    description = description.translate(str.maketrans('', '', string.punctuation))
    #remove number
    description = re.sub(r'\d+', '', description)
    #remove space
    description = description.strip()
    #remove stopword
    description = [word for word in description.split() if word not in stopwords.words('english')]
    #remove short word
    description = [word for word in description if len(word) > 2]
    #remove word with number
    description = [word for word in description if not any(c.isdigit() for c in word)]
    #remove word with special character
    description = [word for word in description if not any(c in string.punctuation for c in word)]
    #remove string The (trivial word)
    description = [word for word in description if not word == 'The']
    #remove string Wine (trivial word)
    description = [word for word in description if not word == 'Wine']
    description = [word for word in description if not word == 'wine']
    #remove string This (trivial word)
    description = [word for word in description if not word == 'This']
    #remove word with underscore
    description = [word for word in description if not any(c == '_' for c in word)]
    #remove word with dash
    description = [word for word in description if not any(c == '-' for c in word)]
    #remove word with slash
    description = [word for word in description if not any(c == '/' for c in word)]
    #remove word with backslash
    description = [word for word in description if not any(c == '\\' for c in word)]
    #remove word with dot
    description = [word for word in description if not any(c == '.' for c in word)]
    #remove word with comma
    description = [word for word in description if not any(c == ',' for c in word)]
    #remove word with colon
    description = [word for word in description if not any(c == ':' for c in word)]
    #remove word with semicolon
    description = [word for word in description if not any(c == ';' for c in word)]
    #remove word with exclamation mark
    description = [word for word in description if not any(c == '!' for c in word)]
    #remove word with question mark
    description = [word for word in description if not any(c == '?' for c in word)]
    
    return description

#Function to create the wordcloud
def createWordCloud(description, title):
    #create wordcloud
    wordcloud = WordCloud(width = 500, height = 500,
                min_font_size = 10, 
                background_color ='white').generate(description) 
    # plot the WordCloud image                        
    #plt.figure(figsize = (25,25), facecolor = None) 
    plt.imshow(wordcloud) 
    plt.axis("off") 
    plt.tight_layout(pad = 0) 
    #plt.title(title, fontsize=50)
    plt.show()

#Function to create the wordcloud for each point
def createWordCloudForPoint(df, pointsDescription):
    #filter by point
    dfPoint = df.loc[df['pointsDescription'] == pointsDescription]
    #clean description
    dfPoint['description'] = dfPoint['description'].apply(lambda x: cleanDescription(x))
    #join all the description
    description = ''.join(' '.join(l) for l in dfPoint['description'].values)

########################
#Remove the comment below to save in a dataframe the most used word for each point
#find most 10 used word in the description save it in a dataframe
#dfMostUsedWord = pd.DataFrame(description.split(), columns=['word']).word.value_counts().reset_index().rename(columns={'index':'word', 'word':'count'}).head(10)
#print(point)
#display(dfMostUsedWord)
########################
    #create wordcloud
    createWordCloud(description, 'Most used words of \'' + str(pointsDescription) + '\' category')

#remove warning
pd.set_option('mode.chained_assignment', None)
#Create wordcloud for each point
for point in set(df['pointsDescription']):
    print(point)
    createWordCloudForPoint(df, point)

#reset warning
pd.reset_option('mode.chained_assignment')
Acceptable
Good
Classic
Very good
Superb
Excellent